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DETAILED ACTION 

1. Claims 1, 3, 7-18 and 36-45 are pending. 

2. Applicant's amendment filed on 08/27/2008 is acknowledged. 

3. Claims 36-43 and 45 stand withdrawn from further consideration pursuant to 37 CFR 
1 .142(b) as being drawn to a nonelected species, there being no allowable generic or linking 
claim. Election was made without traverse in the reply filed on 1 1/16/2007. 

4. Claims 1, 3, 7-18 and 44 are currently under consideration as they read on an allergen 
hybrid protein having reduced allergenicity but retaining immunogenicity, comprising SEQ ID 
NO:l 1 of the Ves v 5 allergen protein and a scaffold protein that is structurally homologous to 
the Ves v 5 allergen protein, wherein the peptide epitope sequence is 9-45 amino acids in length 
and the hybrid protein has a native conformation and the peptide epitope sequence is present in a 
surface accessible region of the hybrid protein corresponding to its position in the allergen 
protein and wherein the peptide epitope sequence is substituted for the scaffold sequence and the 
allergen protein is selected from insect allergens. 

5. In view of the amendments filed on 08/27/2008, only the following rejections are 



maintained 
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6. The following is a quotation of the first paragraph of 35 U.S.C. 112: 

The specification shall contain a written description of the invention, and of the manner and process of making 
and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it 
pertains, or with which it is most nearly connected, to make and use the same and shall set forth the best mode 
contemplated by the inventor of carrying out his invention. 

7. Claims 1, 3, 7-18 and 44 stand rejected under 35 U.S.C. 1 12, first paragraph, because the 
specification, while being enabling for : allergen hybrid proteins of wasp antigen 5 wherein 
peptide epitopes greater than 8 amino acids in length obtained from a wasp antigen 5 protein of 
one species are used to replace the corresponding peptide epitopes in a wasp antigen 5 protein of 
a different wasp species and the resulting hybrid molecule demonstrates reduced allergenicity but 
maintains immunogenicity and maintains the conformation present in a wild-type wasp antigen 5 
protein, does not reasonably provide enablement for : an allergen hybrid protein comprising a 
scaffold protein substituted with a peptide epitope sequence from an allergen protein that is 
structurally homologous to said scaffold protein, said hybrid protein has a native 
conformation when compared to said scaffold wherein the scaffold protein has at least 30 
percent sequence identity to the allergen protein from which the peptide epitope sequence is 
derived, wherein the peptide epitope sequence is 9 to 45 amino acids in length and is present in 
a surface accessible loop or comer region of the hybrid protein corresponding to its position in 
the allergen protein, and wherein the allergen protein is selected from the group consisting of 
grass group 5 pollen allergens, grass group 2 allergens, dust mite group 1 allergens, dust 
mite group 2 allergens, Asterales group 5 allergens, Fagales group 1 allergens, Fagales 
group 2 allergens, Vespid antigen 5 allergens, and Bee venom group 2 allergens of claim 1; 
wherein the scaffold protein has at least 50 percent sequence identity to the allergen from 
which the peptide epitope sequence is derived of claim 3; wherein the peptide epitope 
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sequence is 9 to about 35 amino acids in length of claim 7; wherein the peptide epitope 
sequence is 9 to about 25 amino acids in length of claim 8; wherein the peptide epitope 
sequence is 9 to about 15 amino acids in length of claim 9; which is a hybrid vespid venom 
allergen protein of claim 12; which is a hybrid vespid venom antigen 5 protein of claim 13; 
wherein the peptide epitope sequence is from the genus Vespula and the scaffold protein is 
from the genus Polistes of claim 14; wherein the peptide epitope sequence is from the species 
vulgaris of claim 15; wherein the scaffold protein is from the species annularis of claim 16; a 
hybrid vespid venom antigen 5 allergen comprising scaffold sequence derived from a first 
vespid venom antigen 5 allergen comprising a substitution of a first epitope of said scaffold 
sequence with a second epitope derived from a second vespid venom antigen 5 allergen, said 
second epitope selected from the group consisting of NNYCKIKCLKGG VHTACK (SEQ ID: 2); 
NNYCKIKCLKGGVHTACKYGSLKP (SEQ ID: 3); 
NNYCKIKCLKGGVHTACKYGSLKPNCGNKVVV (SEQ ID: 4); 
NNYCKIKCLKGGVHTACKYGSLKPNCGNKVVVSYGLTKQ (SEQ ID: 5); 
NNYCKIKCLKGGVHTACKYGSLKPNCGNKVVVSYGLTKQEKQDILK (SEQ ID: 6); 
QVGQNVALTGSTAAKYDDPVKLVKMWEDEVKDYNPKKKFSGNDFLKTG (SEQ ID NO: 
7); HYTQMVWANTKEVGCGSIKYIQEKWHKHYLVCNYOPSGNFKNEELYQTK 
(SEQ ID NO: 8); LKPNCGNKVVV (SEQ ID NO: 9); LTGSTAAKYDD (SEQ ID NO: 10); 
PKKKFSGND (SEQ ID NO: 1 1); FKNEELYQTK (SEQ ID NO: 13); 
NNYCKIKCLKGGVHTACKYGSLKPNCGNKVVVSYGLTKQEKQDILKEHND 
(SEQ ID NO: 93); 

NNYCKIKCLKGGVHTACKYGSLKPNCGNKVVVSYGLTKQEKQDILKEHND 
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FRQKIAR (SEQ ID NO: 94); and 

NNYCKIKCLKGGVHTACKYGSLKPNCGNKVVVSYGLTKQEKQDILKEHND 
FRQKIARGLETRGNPGPQPPAKNMKN (SEQ ID NO: 95) and wherein said hybrid vespid 
venom antigen 5 allergen has a native conformation of claim 17; wherein the peptide epitope 
sequence comprises a conservative amino acid change of claim 18 and wherein the peptide 
epitope sequence is from an allergen selected from the group consisting of Dol m 5, Dol a 5, 
Pol a 5, 5, Pole 5, Pol f 5, Pol m 5, Vesp c 5, Vesp m 5, Ves f 5, Ves g 5, Ves m 5, Ves p 5, Ves 
s 5, Ves vi 5, and Ves v 5. The specification does not enable any person skilled in the art to 
which it pertains, or with which it is most nearly connected, to make and or use the invention 
commensurate in scope with this claim for the same reasons as set forth in the Office Action 
mailed on 02/27/2008. 



Applicant's arguments filed on 08/27/2008 have been fully considered, but are not found 
persuasive. 



Applicant argues: 



"The specification enables the full scope of the claims. At the time the application was filed, grass 
groups 5 and 2 pollen allergens, house dust mite groups 1 and 2 allergens, Asterales group 5 allergens, 
Fagales groups 1 and 2 allergens, Vespid antigen 5 allergens, and bee venom group 2 allergens were well- 
characterized classes of proteins. Multiple members of each of these allergen families had been isolated. 
See specification at Table 8, beginning at page 100. Three dimensional structures had been published for 
the Fagales group 1 allergen, Bet v 1 1 and Bet v2 (profylin), house dust mite group 2 allergen, Der p 22, 
Vespidae antigen 5 allergen, Ves v 53. Additionally, at the time the application was filed, a robust 3-D 
structure of the house dust mite group 1 allergen, Der p 1, had been published. Topham et al, 1994, Protein 
Eng. 7:869-894 (Attached at Tab A). Additionally, at the time the application was filed, the 3-D structures 
for grass pollen 5 (as evidenced by the Protein Data Bank Accession number 1L3P), for grass pollen 2 (as 
evidenced by the Protein Data Bank Accession number 1 WHO and 1 WHP), several Asterales pollen (e.g., 
ragweed) allergens (as evidenced by the Protein Data Bank Accession numbers 1BBG, 2BBG, and 3BBG), 
and bee venom group 2 antigen Api m 1 (Protein Data Bank Accession 1FCQ ) were also known. Thus, at 
the time the application was filed, a 3-D structure had been established for at least one member of each of 
the classes of allergens called for in the claims. 



Application/Control Number: 10/091,135 
Art Unit: 1644 



Page 6 



One of ordinary skill in the art would have been able to use the known three dimensional and 
sequence information available for at least the grass group 5 allergens, grass group 2 allergens, dust mite 
group 1 allergens, dust mite group 2 allergens, Asterales group 5 allergens, Fagales group 1 allergens, 
Fagales group 2 allergens, Vespid antigen 5 allergens, and Bee venom group 2 antigen allergens to practice 
the claimed invention without undue experimentation. 

The claims have also been amended to be directed to hybrid proteins wherein the scaffold protein 
and the epitope exhibit at least 30% sequence identity. Support for this amendment may be found at least 
on pages 9-10 of the specification. (See claim 1.) It was well-known at the time the application was filed, 
that homologous proteins with greater than or equal to 30% identity and similar functions have closely 
similar 3-D structures. 

At the time the application was filed, one of ordinary skill could routinely identify homologous 
regions between proteins using standard alignment and homology programs, such as the programs Gap, 
Bestfit and BLAST. See specification at page 18, lines 1 1-14. Having aligned homologous proteins, the 
instant specification provides guidance to choose a surface accessible portion, preferably a loop or comer 
region. See specification at page 10, lines 18-20 and page 17, lines 8-11. As stated in the specification: 

Switching corresponding regions of homologous proteins, especially in surface accessible, e.g., 
loop and comer, regions predictably conserve native structure. Surface accessible regions especially loop 
and comer regions, tend to demonstrate more flexibility and better tolerate changes while retaining 
structure. This approach also finds a counterpart in directed evolution, where homologous enzymes are 
recombined to yield novel, function enzyme chimeras. 

Specification at page 17, lines 8-13. Upon identifying regions of protein that are to be "switched," 
standard molecular biology techniques can be used to construct the hybrid proteins. See specification at 
pages 23-31. Lastly, the specification provides guidance for expressing and purifying hybrid constructs 
having a native structure, e.g., by expressing hybrid protein in insect cells or, preferably, in yeast, e.g., 
Piccia pastohs. Specification at pages 31-36, particularly page 36, lines 7-12. The specification states 
explicitly that, "these expression systems should yield "native" glycosylation and structure, particularly 
secondary and tertiary structure, of the expressed polypeptide." Specification at age 36, lines 10-12. 

Thus, the specification provides all of the guidance required to identify epitopes for substitution, 
construct nucleic acids encoding hybrid proteins, and then express the hybrid proteins in native form. In 
view of the extensive guidance in the specification, there is a high likelihood that a particular hybrid protein 
made according to the specification will include a B cell epitope and be expressed in native form. 

The specification further provides extensive guidance on methods to test the relevant properties of 
hybrid proteins. Native confirmation may be assessed, for example, by circular dichroism (CD) 
spectroscopy, nmr spectroscopy, neutron diffraction and fluorescence spectroscopy. Specification at page 
23, lines 1-14. Lastly, using precisely the methods set forth in the specification, Applicants successfully 
constructed over one dozen hybrid allergens having native structure and reduced allergenicity and retaining 
immunogenicity 

For the reasons set forth above, at the time the application was filed, one of ordinary skill in the art 
would have been able to make and use the claimed invention without undue experimentation. 
Reconsideration of the claims and withdrawal of the rejection thereof for lack of enablement under 35 
U.S.C. §112, first paragraph is respectfully requested. " 
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It is the Examinees position that the specification does not adequately disclose any 
"allergen hybrid protein"; any "peptide epitope sequence"; any "scaffold protein"; any "scaffold 
sequence"; any "allergen protein"; any "grass pollen allergens"; any "mite allergens"; any "weed 
pollen allergens"; any "tree pollen allergens"; any "insect allergens"; any "hybrid vespid venom 
allergen protein"; any "hybrid vespid venom allergen 5 protein"; any "second epitope derived 
from a second vespid venom antigen 5 allergen"; any "scaffold sequence derived from a first 
vespid venom antigen 5 allergen" 

The specification also does not adequately disclose any scaffold protein that has "at least 
30 percent sequence identity to the allergen" or "at least 50 percent sequence identity to the 
allergen." Contrary to Applicant's assertion, homology of 30% does not equate to functional or 
three dimensional structure equivalence between two proteins. Attwood et al. teaches that "[i]t 
is presumptuous to make functional assignments merely on the basis of some degree of similarity 
between sequences (PTO-892; Reference U; In particular, whole document). Similarly, Skolnick 
et al. teaches that the skilled artisan is well aware that assigning functional activities for any 
particular protein or protein family based upon sequence homology is inaccurate, in part because 
of the multifunctional nature of proteins (PTO-892; Reference V; In particular, Abstract and 
Sequence-based approaches to function prediction, page 34). Even in situations where there is 
some confidence of a similar overall structure between two proteins, only experimental research 
can confirm the artisan's best guess as to the function of the structurally related protein (In 
particular, Abstract and Box 2). The claim recitations encompass scaffold proteins having at 
least 30% sequence identity and at least 50% sequence identity to the allergen protein from 
which the peptide epitope is derived. However, sequence identity over the entire allergen protein 



\ 
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is not relevant because the claims only encompass a peptide epitope sequence. The allergen 
protein and the scaffold protein may have 0% sequence identity in the region of the peptide 
epitope sequence, though they have sequence identity in other regions. Alternatively, the 
scaffold protein and the allergen protein may be the same protein. In such a case, the allergen 
hybrid would encompass native protein. Whether or not the recited allergens were known and 
used at the time of invention is not persuasive since the claims encompass as yet unknown 
allergens and derivatives of known and unknown allergens. Using those allergens would be 
unpredictable even though the recited allergens were known. A skilled artisan could not 
reasonably identify "structurally homologous" proteins for use as scaffolds in the instant 
invention because structural homology requires knowledge of the three-dimensional shape of a 
protein, yet the claims read on hybrid allergens for which the three-dimensional structure is not 
known. 

The claims encompass any allergen including as-yet undiscovered allergens and 
derivatives thereof having any number of additions, substitutions and deletions and the allergen 
hybrid protein may comprise a peptide epitope sequence from another unrelated allergen. 
Alternatively, the scaffold protein need not be an allergen at all, so long as it comprises a peptide 
epitope sequence derived from any allergen. Therefore, the recitation of a "scaffold protein" 
encompasses all native, derivative and synthetic proteins. The specification has also not 
adequately disclosed the genus of all "scaffold proteins" and "scaffold sequences." The term 
"scaffold" does not limit these recited proteins and sequences at all. All proteins comprise 
sequence that determines the three dimensional structure. So, the fact that a sequence that 
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determines the three dimensional structure of a protein is recited is not sufficiently disclosed in 
the specification. 

The specification had not adequately disclosed peptide epitope sequences "in a loop or 
corner region of the hybrid protein' 1 Contrary to Applicant's assertion, a skilled artisan would be 
required to perform undue experimentation to determine whether or not the peptide epitope 
sequence of the resulting hybrid protein is in a loop or corner region as the specification has not 
adequately disclosed the structure of the genus of scaffold proteins encompassed by the instant 
claim recitation. The specification has not adequately disclosed the genus of all proteins having 
a 9 to 45 amino acid substitution such that the substitution is in a loop or corner region of the 
protein, as encompassed by the instant claims. 

The examples of the specification do not adequately provide enablement for one of 
ordinary skill in the art to make and use the genus of hybrid proteins wherein the peptide epitope 
sequence is substituted for the scaffold sequence. It is known that alteration of an allergen to 
reduce immunogenicity is unpredictable even when the epitopes responsible for binding IgE 
have been previously identified. Further, the specification has not adequately disclosed that the 
genus of allergen hybrid proteins could be used for diagnostic and/or therapeutic purposes. 

8. Claims 1, 3, 7-18 and 44 stand rejected under 35 U.S.C. 1 12, first paragraph, as 

containing subject matter which was not described in the specification in such a way as to 
reasonably convey to one skilled in the relevant art that the inventor(s), at the time the 
application was filed, had possession of the claimed invention. 
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Applicant is in possession of : allergen hybrid proteins of wasp antigen 5 wherein peptide 
epitopes greater than 8 amino acids in length obtained from a wasp antigen 5 protein of one 
species are used to replace the corresponding peptide epitopes in a wasp antigen 5 protein of a 
different wasp species and the resulting hybrid molecule demonstrates reduced allergenicity but 
maintains immunogenicity and maintains the conformation present in a wild-type wasp antigen 5 
protein. 

A pplicant is not in possession of : an allergen hybrid protein comprising a scaffold 
protein substituted with a peptide epitope sequence from an allergen protein that is 
structurally homologous to said scaffold protein, said hybrid protein has a native 
conformation when compared to said scaffold wherein the scaffold protein has at least 30 
percent sequence identity to the allergen protein from which the peptide epitope sequence is 
derived, wherein the peptide epitope sequence is 9 to 45 amino acids in length and is present in 
a surface accessible loop or comer region of the hybrid protein corresponding to its position in 
the allergen protein, and wherein the allergen protein is selected from the group consisting of 
grass group 5 pollen allergens, grass group 2 allergens, dust mite group 1 allergens, dust 
mite group 2 allergens, Asterales group 5 allergens, Fagales group 1 allergens, Fagales 
group 2 allergens, Vespid antigen 5 allergens, and Bee venom group 2 allergens of claim 1; 
wherein the scaffold protein has at least 50 percent sequence identity to the allergen from 
which the peptide epitope sequence is derived of claim 3; wherein the peptide epitope 
sequence is 9 to about 35 amino acids in length of claim 7; wherein the peptide epitope 
sequence is 9 to about 25 amino acids in length of claim 8; wherein the peptide epitope 
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sequence is 9 to about 15 amino acids in length of claim 9; which is a hybrid vespid venom 
allergen protein of claim 12; which is a hybrid vespid venom antigen 5 protein of claim 13; 
wherein the peptide epitope sequence is from the genus Vespula and the scaffold protein is 
from the genus Polistes of claim 14; wherein the peptide epitope sequence is from the species 
vulgaris of claim 15; wherein the scaffold protein is from the species annularis of claim 16; a 
hybrid vespid venom antigen 5 allergen comprising scaffold sequence derived from a first 
vespid venom antigen 5 allergen comprising a substitution of a first epitope of said scaffold 
sequence with a second epitope derived from a second vespid venom antigen 5 allergen, said 
second epitope selected from the group consisting of NNYCKIKCLKGG VHTACK (SEQ ID: 2); 
NNYCKIKCLKGGVHTACKYGSLKP (SEQ ID: 3); 
NNYCKIKCLKGG VHTACKYGSLKPNCGNKVVV (SEQ ID: 4); 
NNYCKIKCLKGG VHTACKYGSLKPNCGNKVVVSYGLTKQ (SEQ ID: 5); 
NNYCKIKCLKGGVHTACKYGSLKPNCGNKVVVSYGLTKQEKQDILK (SEQ ID: 6); 
QVGQN V ALTG STAAK YDDPVKL VKM WEDEVKD YNPKKKF SGNDFLKTG (SEQ ID NO: 
7); HYTQMVWANTKEVGCGSIKYIQEKWHKHYLVCNYOPSGNFKNEELYQTK 
(SEQ ID NO: 8); LKPNCGNKVVV (SEQ ID NO: 9); LTGSTAAKYDD (SEQ ID NO: 10); 
PKKKFSGND (SEQ ID NO: 1 1); FKNEELYQTK (SEQ ID NO: 13); 
NNYCKIKCLKGGVHTACKYGSLKPNCGNKVVVSYGLTKQEKQDILKEHND 
(SEQ ID NO: 93); 

NNYCKIKCLKGGVHTACKYGSLKPNCGNKVVVSYGLTKQEKQDILKEHND 
FRQKIAR (SEQ ID NO: 94); and 

NNYCKIKCLKGG VHTACKYGSLKPNCGNKVVVSYGLTKQEKQDILKEHND 
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FRQKIARGLETRGNPGPQPPAKNMKN (SEQ ID NO: 95) and wherein said hybrid vespid 
venom antigen 5 allergen has a native conformation of claim 17; wherein the peptide epitope 
sequence comprises a conservative amino acid change of claim 18 and wherein the peptide 
epitope sequence is from an allergen selected from the group consisting of Dol m 5, Dol a 5, 
Pol a 5, 5, Pole 5, Pol f 5, Pol m 5, Vesp c 5, Vesp m 5, Ves f 5, Ves g 5, Ves m 5, Ves p 5, Ves 
s 5, Ves vi 5, and Ves v 5 for the same reasons as set forth in the Office Action mailed on 
02/27/2008. 



Applicant's arguments filed on 08/27/2008 have been fully considered, but are not found 
persuasive. 



Applicant argues: 

"As set forth above in connection with the response to the enablement rejection, the claims have 
been directed to hybrid allergen proteins from classes of allergens that are well characterized by both 
sequence and three-dimensional structure and in which the scaffolding protein and the allergen share at 
least 30% sequence identity. As set forth above, using the known three dimensional structure of at least one 
member of the grass group 5 allergens, grass group 2 allergens, dust mite group 1 allergens, dust mite 
group 2 allergens, Asterales group 5 allergens, Fagales group 1 allergens, Fagales group 2 allergens, Vespid 
antigen 5 allergens, and Bee venom group 2 antigen allergens, and routine and predictable sequence 
alignment programs, one of ordinary skill in the art could recognize the sequence of any of the hybrid grass 
group 5 allergens, grass group 2 allergens, dust mite group 1 allergens, dust mite group 2 allergens, 
Asterales group 5 allergens, Fagales group 1 allergens, Fagales group 2 allergens, Vespid antigen 5 
allergens, and Bee venom group 2 antigen allergens encompassed by the claims. The respective sequences 
provide sufficient physical characteristics to comply with the written description requirement. " 

It remains the Examiner's position that the specification does not disclose a correlation 
structure of the hybrid allergen and the functional limitations ( "an allergen protein that is 
structurally homologous to said scaffold protein"; "has a native conformation when compared to 
said scaffold protein"; "wherein the scaffold protein has at least 30 percent sequence identity to 
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the allergen from which the peptide epitope sequence is derived"; "is present in a surface 
accessible loop or corner region of the hybrid protein corresponding to its position in the allergen 
protein" of claim 1; "wherein the scaffold protein has at least 50 percent sequence identity to the 
allergen from which the peptide epitope sequence is derived" of claim 3; "wherein the peptide 
epitope sequence is 9 to 35 amino acids in length" of claim 7; "wherein the peptide epitope 
sequence is 9 to 25 amino acids in length" of claim 8; "wherein the peptide epitope sequence is 9 
to 15 amino acids in length" of claim 9; "which is a hybrid vespid venom allergen protein" of 
claim 12; "which is a hybrid vespid venom antigen 5 protein" of claim 13; wherein the peptide 
epitope sequence is from the genus Vespula and the scaffold protein is from the genus Polistes" 
of claim 14; "wherein the peptide epitope sequence is from the species vulgaris of claim 15; 
"wherein the scaffold protein is from the species annularis" of claim 16; "comprising scaffold 
sequence derived from a first vespid venom antigen 5 allergen comprising a substitution of a first 
epitope of said scaffold sequence"; and "wherein said hybrid vespid venom antigen 5 allergen 
has a native conformation" of claim 17; and "comprises a conservative amino acid change" of 
claim 18) such that a skilled artisan would have known what allergen hybrid proteins attain the 
claimed function and functional limitations. "Possession may not be shown by merely 
describing how to obtain possession of member of the claimed genus or how to identify their 
common structural features" ex parte Kubin (83 U.S.P.Q.2d 1410 (BPAI 2007)), at page 16. In 
this instant case Applicants have not provided any guidance as to what allergens, peptide epitope 
sequences and scaffold proteins will result in the claimed functional limitations. "Without a 
correlation between structure and function, the claim does little more than define the claimed 
invention by function" supra, at page 17. 
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Applicant's argument that the claims have been directed to hybrid allergen proteins from 
classes of allergens that are well characterized by both sequence and three-dimensional structure 
and in which the scaffolding protein and the allergen share at least 30% sequence identity and 
using the known three dimensional structure of the allergens and routine and predictable 
sequence alignment programs, one of ordinary skill in the art could recognize the sequence of 
any of the hybrid allergens encompassed by the claims is not sufficient. The specification must 
also set forth the structural features that allow one of ordinary skill in the art to produce allergen 
hybrid proteins with the required functional limitations: "an allergen protein that is structurally 
homologous to said scaffold protein"; "has a native conformation when compared to said 
scaffold protein"; "wherein the scaffold protein has at least 30 percent sequence identity to the 
allergen from which the peptide epitope sequence is derived"; "is present in a surface accessible 
loop or corner region of the hybrid protein corresponding to its position in the allergen protein" 
of claim 1; "wherein the scaffold protein has at least 50 percent sequence identity to the allergen 
from which the peptide epitope sequence is derived" of claim 3; "wherein the peptide epitope 
sequence is 9 to 35 amino acids in length" of claim 7; "wherein the peptide epitope sequence is 9 
to 25 amino acids in length" of claim 8; "wherein the peptide epitope sequence is 9 to 15 amino 
acids in length" of claim 9; "which is a hybrid vespid venom allergen protein" of claim 12; 
"which is a hybrid vespid venom antigen 5 protein" of claim 13; wherein the peptide epitope 
sequence is from the genus Vespula and the scaffold protein is from the genus Polistes" of claim 
14; "wherein the peptide epitope sequence is from the species vulgaris of claim 15; "wherein the 
scaffold protein is from the species annularis" of claim 16; "comprising scaffold sequence 
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derived from a first vespid venom antigen 5 allergen comprising a substitution of a first epitope 
of said scaffold sequence"; and "wherein said hybrid vespid venom antigen 5 allergen has a 
native conformation" of claim 17; and "comprises a conservative amino acid change" of claim 
18, but there is no guidance on mutant hybrid allergens with these properties. The working 
examples are not sufficient to support the genus of allergen hybrid proteins encompassed by the 
claimed invention. In the instant case, definition by functional limitations does not suffice to 
define the genus because it is only an indication of what functional properties it has, rather than 
what it is. 



9. No claim is allowed. 

10. THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time 
policy as set forth in 37 CFR 1.136(a). 

A shortened statutory period for reply to this final action is set to expire THREE 
MONTHS from the mailing date of this action. In the event a first reply is filed within TWO 
MONTHS of the mailing date of this final action and the advisory action is not mailed until after . 
the end of the THREE-MONTH shortened statutory period, then the shortened statutory period 
will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 
CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, 
however, will the statutory period for reply expire later than SIX MONTHS from the mailing 
date of this final action. 



1 1 . Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Nora M. Rooney whose telephone number is (571) 272-9937. 
The examiner can normally be reached Monday through Friday from 8:30 am to 5:00 pm. A 
message may be left on the examiner's voice mail service. If attempts to reach the examiner by 
telephone are unsuccessful, the examiner's supervisor, Eileen OHara can be reached on (571) 
272-0878. The fax number for the organization where this application or proceeding is assigned 
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is 571-273-8300. 



Information regarding the status of an application may be obtained from the Patent 
Application Information Retrieval (PAIR) system. Status information for published applications 
may be obtained from either Private PAIR or Public PAIR. Status information for unpublished 
applications is available through Private PAIR only. For more information about the PAIR 
system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private 
PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). 

December 4, 2008 
Nora M. Rooney, M.S., J.D. 
Patent Examiner 
Technology Center 1600 

/Maher M. Haddad/ 

Primary Examiner, Art Unit 1 644 
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Science's compass 



TECHVIEW: GENOMICS 



The Babel of Bioinformatics 



Teresa ICAttwood 



The sequencing of entire genomes is a 
major achievement, but the meaning of 
the mass of accumulated data is only 
just beginning to be unraveled. At first sight, 
the task appears straightforward: locate the 
genes and translate the coding regions to es- 
tablish their protein products; perform simi- 
larity searches to establish relationships with 
previously characterized sequences and as- 
sign function by evolutionary inference; and 
rationalize the function in structural terms us- 
ing known or model-derived structures. Giv- 
en the quantity of data, me procedures should 
be automated as much as possible. 

The reality, of course, is not so simple. At- 
tempts to decipher the clues latent in genom- 
ic data are hampered because current meth- 
ods to predict genes in uncharacterized DNA 
are unreliable (and it is not always clear what 
we mean by "gene"); it is presumptuous to 
make functional assignments merely on the 
basis of some degree of similarity between 
sequences (and it is not always clear what we 
mean by "function"); very few structures are 
known compared with the number of se- 
quences, and structure prediction methods 
are unreliable (and knowing structure does 
not inherently tell us function); the degree of 
automation that has been used of necessity, 
with its imperfect tools and protocols, has led 
to the accumulation of much database misin- 
formation; and the terminology has been im- 
precise, muddying perceptions of what can 
realistically be achieved. Given these prob- 
lems, what is the state of the art in sequence- 
stnicture-function bioinformatics? 

Gene prediction 

Information used to predict genes includes 
signals in the sequence, content statistics, 
and similarity to known genes. In a recent 
test of gene detection tools on part of the 
Drosophila genome, the majority of these 
"gene finders" identified 95% of coding 
nucleotides, but intron/exon structures 
were correctly predicted for only about 
40% of genes. The different methods failed 
to find between 5% and 95% of genes, and 
incorrectly identified up to 55% (i). But 
probably the most sobering evidence of the 
frailty of gene prediction methods is the 
uncertainty in the number of genes in the 
human genome, with current estimates 
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ranging from 27,462 to 312,278. The meth- 
ods used to arrive at these numbers each 
involve different approximations and ex- 
trapolations. Nevertheless, it is disturbing 
that the different analytical approaches 
should yield such disparate results. 

What is a gene? 

Perhaps the biggest obstacle to accurate 
gene counting is that even the definition of 
a gene is unclear. Is it a heritable unit cor- 
responding to an observable phenotype? 
Or is it a packet of genetic information 
that encodes a protein, or proteins? Or per- 
haps one that encodes RNA? Must it be 
translated? Are genes genes if they are not 
expressed? As definitions vary, inevitably 
so do estimates of the total number of 
genes in sequenced genomes. 

A Sequence? 
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The sequence-structure imbalance 

To date, more than 540,000 protein sequences 
have been deposited in the nonredundant 
database maintained by the National Center 
for Biotechnology Information (NCBI), and 
millions of expressed sequence tags (ESTs), 
which are partial sequences of clones that are 
often error prone, are housed in public and 
proprietary repositories. These numbers will 
snowball with the fruition of further genome 
projects. By contrast, the number of unique 
protein structures is still less than 2000. Of 
course, we do not know how many unique se- 
quences there are; nevertheless, it is clear that 
there is a dearth of structural information. 

Given this sequence-structure imbalance, 
it is imperative that we focus on deciphering 
the structural, functional, and evolutionary 
clues encoded in the language of biological 
sequences. Two distinct analytical approach- 
es have emerged. Pattern recognition meth- 
ods aim to detect similarity between se- 
quences and structures and infer related 
functions. Thus, they require some charac- 
teristic to have been observed and deposited 
in a reference database. In contrast, ab initio 
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Levels of complexity. Looking at a sequence (A) or a fold (B) in isolation, we can say little about 
its function. Only when we look at sequences or structures together do the patterns of conserva- 
tion that emerge (motifs) begin to provide functional dues. For example, the above motifs (C) may 
suggest roles in calcium binding, nucleotide binding, and membrane anchoring. We can think of 
folds as providing different scaffolds, which can be decorated in different ways by different se- 
quences to confer different functions. Knowing both the fold and the function allows us to ratio- 
nalize the mechanism by which the structure effects its function at the molecular leveL 
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prediction methods deduce structure directly 
from sequence. The approaches are quite 
different and should not be confused Their 
levels of success also differ markedly. 

Function prediction through pattern 
recognition 

Tools for similarity searching are standard 
components of the sequence annotator's ar- 
mory. Sequence similarity programs may seek 
pairwise similarities in large sequence reposi- 
tories or search for conserved patterns in gene 
family databases (2-5). Gene family databas- 
es allow more specific functional diagnoses to 
be made than is possible by pairwise search- 
ing They are based on the principle that relat- 
ed sequences can be aligned to find regions 
(motifs) that show little variation. These mo- 
tifs usually reflect some vital structural or 
functional role (see the figure), and they can 
be used to derive diagnostic family signatures. 
Sequences can then be searched against 
databases of such signatures to see whether 
they can be assigned to known families. Gene 
family databases have recently been integrat- 
ed to create a unified protein family resource 
(6), facilitating the inference of function by 
identifying homologous relationships. 

The term "homology" a fundamental con- 
cept in bioinformatics, is often used incor- 
rectly. Sequences are homologous if they are 
related by divergence from a common ances- 
tor (7). Conversely, analogy relates to the ac- 
quisition of common structural or functional 
features via convergent evolution from unre- 
lated ancestors. For example, p barrels occur 
in soluble serine proteases and integral mem- 
brane porins, but despite their common archi- 
tecture, they share no sequence or functional 
similarity. Similarly, the enzymes chy- 
motrypsin and subtilisin share groups of cat- 
alytic residues with almost identical spatial 
geometries, but they have no other sequence 
or structural similarities. Homology is not a 
measure of similarity, but rather an absolute 
statement that sequences have a divergent 
rather than a convergent relationship. This is 
not just a semantic issue because imprecise 
use of the term obscures evolutionary rela- 
tionships. In comparing structures, the same 
arguments apply. Structures may be similar, 
but common evolutionary origin remains a 
hypothesis until supported by other evidence; 
the hypothesis may be correct or mistaken, 
but the similarity is a fact (8). 

Among homologous sequences, we can 
distinguish orthologs (proteins that usually 
perform the same function in different 
species) and paralogs (proteins that perform 
different but related functions within one or- 
ganism). Orthologs allow investigation of 
cross-species relationships, whereas paralogs, 
which arise via gene duplication events, shed 
light on underlying evolutionary mechanisms 
because the duplicated genes follow separate 



evolutionary pathways and new specificities 
evolve through variation and adaptation. 
Such complexity presents real challenges for 
bioinformatics. When analyzing a database 
search, it may be unclear how much function- 
al annotation can be legitimately inherited by 
a query sequence, and whether the best 
match turned up by the search is the true or- 
tholog or a paralog. This difficulty is the 
source of numerous annotation errors. 

Further complications result from the 
domain and/or modular nature of many pro- 
teins. Modules are autonomous folding 
units that often function as protein building 
blocks, forming multiple combinations of 
the same module or mosaics of different 
modules. They can confer a variety of func- 
tions on the parent protein. If the best hit in 
a database search is a match to a single do- 
main or module, it is unlikely that the func- 
tion annotation can be propagated from the 
parent protein to the query sequence. 

In using modules to confer different 
functionalities, Nature uses old material to 
create new systems. The complexity of such 
systems poses important problems for com- 
putational approaches because the proper- 
ties of a system can be explained by but not 
deduced from those of its components (9, 
70). The presence of a module tells little of 
the function of the complete system; know- 
ing most components of a mosaic does not 
allow us easily to predict a missing one, and 
modules in different proteins do not always 
perform the same function. 

Many other factors also complicate 
function assignment: gene functions may 
be redundant, nonorthologous displace- 
ment can replace genes with unrelated but 
functionally analogous genes, horizontal 
gene transfer can introduce genes from 
different phylogenetic lineages, and lin- 
eage-specific gene loss can eliminate an- 
cestral genes. Thus, genomes harbor many 
obstacles to reliable function assignment. 

What is function? 

Protein function is context-dependent Vague- 
ness in using the term has yielded confusing 
database annotations. It is currently used to 
refer variously to biochemical activities, bio- 
logical goals, and cellular structure; for exam- 
ple, the function of actin might be described 
as "ATPase" or "constituent of the cytoskele- 
toa" In an attempt to introduce rigor into the 
field and better reflect biological reality, inde- 
pendent ontologies such as the Gene Ontolo- 
gy (7 7) are under development that aim to de- 
fine more explicitly the relationships between 
gene products and biological processes, 
molecular functions, and cellular components. 

Structure prediction and fold recognition 

We have seen that definitions of "genes'* 
differ, making it difficult to count genes 



accurately, and that our concepts of "func- 
tion" differ, making function assignment 
tricky. It would seem, however, that we can 
agree on what structures are. They are tan- 
gible, measurable things, so should we not 
be able to predict them reliably? 

Structure prediction methods range 
from computationally intensive strategies 
that simulate the physical and chemical 
forces involved in protein folding to 
knowledge-based approaches that use in- 
formation from structure databases to 
build models. Yet the problem of predict- 
ing protein structure remains unsolved: 
knowledge-based techniques typically pro- 
duce low-resolution models, and no cur- 
rent method yields reliable predictions for 
remote homologs (72). For small proteins, 
ab initio methods generate models with 
substantial segments that resemble the cor- 
rect fold, but results deteriorate beyond 
-100 residues. Today, knowledge-based 
methods, especially those that combine in- 
formation from different approaches, give 
best results (75). The most successful 
modeling and fold recognition studies have 
balanced better algorithms with appropri- 
ate levels of manual intervention (14). 

Prediction methods do not work well 
because we do not fully understand how 
the primary structure of a protein deter- 
mines its tertiary structure. Structural ge- 
nomics projects will gradually lessen our 
reliance on prediction, because they aim to 
provide experimental structures or models 
for every protein in all completed genomes 
(although membrane protein structures 
will be difficult to obtain because they are 
difficult to crystallize). We must keep in 
mind, however, that structure alone will 
not inherently tell us function (see the fig- 
ure). For example, determining the struc- 
ture of a hypothetical protein and discover- 
ing that it binds ATP (75) may shed light 
on possible aspects of its functionality, but 
such information does not reveal its spe- 
cific biological function. 

What is structure? 

In the context of fold recognition and pre- 
diction, it is important to be precise about 
what we mean by "structure." For exam- 
ple, is a prediction a "good" prediction if 
it correctly reproduces all atomic posi- 
tions, the topology (connectivity of sec- 
ondary structures), the architecture (gross 
arrangement of secondary structures), or 
merely the structural class (mainly a, 
mainly p, etc.)? Where does a "reasonably 
good" prediction fall in this hierarchy, and 
what level of structural detail does a 
"tough near miss" (16) reveal? Using such 
imprecise words hinders comprehension, 
making it difficult to evaluate what a good 
prediction really is. 
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Outlook 

In "predicting" genes, protein functions, 
and structures, it is helpful to define our 
terms precisely and be honest about our 
achievements. Otherwise, we will continue 
to be baffled by paradoxical new prediction 
methods that yield >80% error rates. Gene 
identification, structure prediction, and 
functional inference are nontrivial compu- 
tational tasks, but with the relentless accu- 
mulation of sequence data, improvements 
continue to be made in all areas. 

Nature functions by integration, and the 
adoption of a more holistic view of complex 
biological systems is an essential next step 
for bioinformatics. To get the most from ge- 
nomic data, we need to take account of in- 
formation on the regulation of gene expres- 
sion, metabolic pathways, and signaling cas- 
cades. Proteins do not work in isolation but 
are involved in interrelated networks. Un- 
raveling these networks and their interac- 
tions will be vital to our understanding of 
normal and pathologic cell development, 
and will help us create an integrated map- 
ping between genotype and phenotype. 

Genomics-based drug discovery is 
heavily dependent on accurate functional 
annotation. Toward this end, bioinformatics 
will need to deliver highly integrated, inter- 
operable databases (and data "warehous- 
es") that allow the user to reason over dis- 
parate data sources and ultimately enable 
knowledge-based inference and innovation. 
The more genome annotation is automated, 
the greater will be the need for collabora- 
tion between software developers, annota- 
tors, and experimentalists. And the more 
data we have to handle, the more rigorous 
we must be in our thinking (and writing) if 
we are to make sense of the complexities. 
Sequence-structure-function bioinformat- 
ics does not yet yield all the answers, but a 
future holistic approach should help fuse 
today's glimmerings of knowledge into a 
new dawn of understanding. 
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t j c h s i i g h t i ng 
software 

Conquering by 
Dividing 

The average personal computer 
spends much less than half a day 
actually performing useful compu- 
tations. Many users, concerned about the 
vulnerability of expensive electronic 
components to the constant 
cycling of the power on and 
off, leave their systems on 
continuously. It is staggering 
to imagine the enormous, un- 
used computing resources of 
several million PCs left run- 
ning unattended. One popular 
approach to tapping this com- 
puting power is the Search for Extra-Ter- 
restrial Intelligence (SETI) project (7), 
which breaks giant computing problems 
into pieces that can be solved on personal 
computers in their spare time. 

Popular Power, Inc. is a company of- 
fering a new twist on this theme. Like 
SETI, a company computer feeds pieces 
of large computing problems to net- 
worked personal computers via their soft- . 
ware program, Popular Power Worker, for 
idle-time operation. Popular Power's ap- 
proach differs, however, in providing a 
variety of computing problems to work 
on. These include nonprofit projects with 
no financial incentive to the personal 
computer owner, as well as commercial 
jobs that will eventually pay users for 
tasks performed on their machines. 

The current version of the Popular Pow- 
er Worker runs only on Windows and Lin- 
ux systems and is officially in pre-release 
form. The preliminary status of the soft- 
ware is readily apparent; numerous bugs, 
frequent crashes, and difficulties in instal- 
lation plague the program currently. If in- 
formation at the company Web site is accu- 
rate, personal computer owners interested 
in Popular Power's computing model may 
find dealing with the problems of the early 
release worth their while. Users of the pre- 
release software are promised priority of 
access to commercial computing jobs after 
the official version is released. Popular 
Power Worker can be downloaded for free 
from the company's Web site, and it installs 
as a screen saver, which starts the program 
running when it becomes active. Future 
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versions of the program for Macintosh and 
Solaris systems are planned. 

The benefits of the Popular Power 
scheme for distributed computing tasks do 
not accrue solely to the user whose com- 
puter is used The flexible nature of Popu- 
lar Power's design provides access for 
businesses, scientists, and anyone with 
massive computing projects to computing 
power that is potentially far greater than 
they would gain from a fixed piece of 
hardware. Personal computer users might 
be able to select which com- 
mercial job to run through 
Popular Power Worker de- 
pending on the return offered 
by the originating contractor. 
A key to the success of the 
computing model is likely to 
be the price Popular Power de- 
mands for acting as the inter- 
face between the computing project cre- 
ators and the personal computer users. 

In summary, the current version of 
Popular Power Worker is still in the testing 
phase and users may find the software un- 
stable. Tech-sawy personal computer en- 
thusiasts are best suited to test the current 
pre-release product. The remaining users 
are advised to wait at least for the official 
release of the software. 

—Kevin ahern 

Department of Biochemistry and Biophysics, Oregon 
State University. CorvaUis. OR 97331, USA. E-mail: 
ahemk@ucsorst.edu 

References 

1. j. Kaiser, Sdence 282. 839 (1998). 

TECHSICHTINC 
SOFTWARE 

Eyes on the Skies 

i he orbital space above Earth con- 
tains an astonishing collection of 
man-made satellites. Tracking all of 
these objects is no small task. Liftoff* is a 
NASA Web site that provides several soft- 
ware tools to locate, track, and identify 
Earth-orbiting 
satellites. At the 
Web site, three pro- 
grams are available: 
J-Pass (identifies 
satellites passing 
overhead); J-Track 
(allows one to track 
orbiting objects); 
and J-Track 3D (al- 
lows one to view satellites orbiting Earth 
from a perspective far away in space). 
Each of these platform-independent appli- 
cations is written in Java and is accessible 
from both Internet Explorer and Netscape 



J-Track. J-Track 3D, 
and J-Pass 
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Free 

http://Uftoff.msfc.nasa. 
gov/re a Itime/ JTrack/ 
Spacecrafthtml 
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From genes to protein structure and function: 
novel applications of computational 
approaches in the genomic era 

Jeffrey Skolnick and Jacquelyn S. Fetrow 

The genome-sequencing projects are providing a detailed 'parts list of life. A key to comprehending this list is understanding 
the function of each gene and each protein at various levels. Sequence-based methods for function prediction are inadequate 
because of the multifunctional nature of proteins. However, just knowing the structure of the protein is also insufficient for 
prediction of multiple functional sites. Structural descriptors for protein functional sites are crucial for unlocking the secrets 
in both the sequence and structural-genomics projects. 



Genome-sequencing projects are providing a 
detailed 'parts list' for life. Unfortunately, this list, 
a portion of which represents the amino acid 
sequence of all the proteins in a given genome, does 
not come with an instruction manual. That is, given 
the genome s sequences, one does not necessarily know 
straight away which regions encode proteins, which 
serve a regulatory role and which are responsible for 
the structure and replication of the DNA itself. 

This is not unlike giving a child a list of parts nec- 
essary to create a working automobile. 'Without the 
necessary expertise, creating the final, working car from 
just the initial parts list is a nearly impossible task. Simi- 
larly, understanding how to create a complete, func- 
tioning cell given just the sequence of nucleotides 
found in an organisms genome is a complex problem. 

What is a protein function? 

After a genome is sequenced and its complete parts 
list determined, the next goal is to understand the func- 
tion(s) of each part, including that of the proteins. What 
do we mean by protein function, the focus of this article? 

Function has many meanings. At one level, the pro- 
tein could be a globular protein, such as an enzyme, 
hormone or antibody, or it could be a structural or 
membrane-bound protein. Another level is its bio- 
chemical function, such as the chemical reaction and 
the substrate specificity of an enzyme. The regulatory 
molecules or cofactors that bind to a protein are also 
levels of biochemical function. 

At the cellular level, the protein s function would 
involve its interaction with other macromolecuies and 
the function and cellular location of such complexes. 
There is also the protein's physiological function; that 
is, in which metabolic pathway the protein is involved 
or what physiological role it performs in the organism. 
Finally, the phenotypic function is the role played by 
the protein in the total organism, which is observed by 
deleting or mutating the gene encoding the protein. 

J. Skolnick (skolnkk@4anforthcenter.ojg) is at the Danfortli Plant Science 
Center, Laboratory of Computational Genomics, 4041 Forest Park 
Avenue, St Louis, MO 63108, USAJ.S. fetrow is at GenePonnatics, 
Suite 200, 5830 Oberlin Drive, San Diego, CA 9212U3754, USA. 



Obviously, the complete characterization of protein 
function is difficult but efforts are under way at all levels 1-4 , 
including cellular function 5 ' 6 . In this article, however, 
we focus on identifying the biochemical function of a 
protein given its sequence, a problem that is amenable to 
molecular approaches. 

Sequence-based approaches to function 
prediction 

The sequence-to-function approach is the most com- 
monly used function-prediction method. This robust 
field is well developed and, in the interest of space 
limitations, we will merely present a brief overview. 

There are two main flavors of this approach: sequence 
alignment 7-9 ; and sequence-motif methods such as 
Prosite 10 , Blocks 11 , Prints 1213 and Emotif 14 . Both the 
alignment and the motif methods are powerful but a 
recent analysis has demonstrated their significant limi- 
tations 15 , suggesting that these methods will increasingly 
fail as the protein-sequence databases become more 
diverse. 

An extension of these approaches that combines 
protein-sequence with structural information has been 
developed and some successes have been reported 16 . 
However, this method still applies the structural infor- 
mation in a one-dimensional, 'sequence-like' fashion 
and fails to take into account the powerful three- 
dimensional information displayed by protein structures. 

In addition, proteins can gain and lose function dur- 
ing evolution and may, indeed, have multiple functions 
in the cell (Box 1). Sequence-to-function methods 
cannot specifically identify these complexities. Inaccu- 
rate use of sequence-to-function methods has led to 
significant function-annotation errors in the sequence 
databases 17 . 

An alternative approach 

An alternative, complementary approach to protein- 
function prediction uses the sequence-to-structure-to- 
function paradigm. Here, the goal is to determine the 
structure of the protein of interest and then to identify 
the functionally important residues in that structure. 
Using the chemical structure itself to identify functional 
sites is more in line with how the protein actually works. 
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In a sense, this is one long-term goal of 'structural 
genomics' projects 1819 , which are designed to deter- 
mine all possible protein folds experimentally, just 
as genome-sequencing projects are determining all 
protein sequences 20 . This is in contrast to traditional 
structural-biology approaches, in which one knows the 
protein's function first and only then, if the function is 
sufficiently important, determines its structure. 

It is implicidy assumed that having the protein's struc- 
ture will provide insights into its function, thereby fur- 
thering the goals of the human-genome-sequencing 
project. However, knowing a protein's three-dimensional 
structure is insufficient to determine its function 
(Box 2). What we really need to analyse and predict the 
multifunctional aspects of proteins is a method spe- 
cifically to recognize active sites and binding regions in 
these protein structures. 

Active-site identification 

In order to use a structure-based approach to function 
prediction, one must identify the key residues respon- 
sible for a given biochemical activity. For many years, 
it has been suggested that the active sites in proteins are 
better conserved than the overall fold. Taken to the 
limit, this suggests that one could not only identify dis- 
tant ancestors with the same global fold and the same 
activity but also proteins with similar functions but 
distandy related, or possibly unrelated, global folds. 

The validity of this suggestion was demonstrated 
empirically by Nussinov and co-workers, who showed 
that the active sites of eukaryotic serine proteases, sub- 
tilisins and sulfhydryl proteases exhibit similar structural 
motifs 21 . Furthermore, in a recent modeling study of 
Saccharomyces cerevisiae proteins, protein functional sites 
were found to be more conserved than other parts of 
the protein models 22 . Similarly, it has been demon- 
strated that the catalytic triad of the a/p hydrolases 
is structurally better conserved than other histidine- 
containing triads 23 . A comparison of the structure of the 
hydrolase catalytic triad to other histidine-containing 
triads shows a distinct bimodal distribution, while a 
similar analysis done with a randomly selected triad shows 
a unimodal distribution (Fig. 1). 

Kasuya and Thornton 24 generalized this example by 
creating structural analogs of a few Prosite sequence 
motifs 10 . For the 20 most-frequendy occurring Prosite 
patterns, the associated local structure is quite distinct. 
These results provide clear evidence that enzyme active 
sites are indeed more highly conserved than other.parts 
of the protein. 

Identifying active sites in experimental structures 

Historically, several groups have attempted to iden- 
tify functional sites in proteins; these efforts were 
directed at protein engineering or building functional 
sites in places where they did not previously exist. This 
has been successfuUy accomplished for several metal- 
binding sites 25-33 . However, highly accurate functional- 
site descriptors of the backbone and side-chain atoms were 
required, fueling the belief that significant atomic detail 
is required in site descriptors for function identification. 

Highly detailed residue side-chain descriptors of the 
active sites of serine proteases and related proteins have 
been used to identify functional sites 3 . The use of these 
highly detailed motifs has led to the identification of 



Box 1. Proteins are multifunctional 



A common protein characteristic that makes functional analysis based 
only on homology especially difficult is the tendency of proteins to be 
multifunctional. For instance, lactate dehydrogenase binds NAD, sub- 
strate and zinc, and performs a redox reaction. Each of these occurs 
at different functional sites that are in close proximity and the combi- 
nation of all four sites creates the fully functional protein. 

Other examples of multifunctional proteins are the nucleic-acidbinding 
proteins. For instance, DNA regulatory proteins often contain a DNA- 
binding domain, a multimerization domain and additional sites that bind 
regulatory proteins; a classic example is RecA 59 . The 3C rhinovirus 
protease exhibits a proteolytic function as well as an RN/Vbinding 
function 60 - 61 . Transcription factors are also complex, multifunctional 
proteins 62 . It is becoming increasingly important to recognize each of 
these different functions of gene products of a newly sequenced gene. 

The serine-threonine-phosphatase superf amily is a prime example of 
the difficulties of using standard sequence analysis to recognize the 
multiple functions found in single proteins. This large protein family is 
divided into a number of subfamilies, all of which contain an essential 
phosphatase active site. Subfamilies 1 , 2A and 2B exhibit 40% or more 
sequence identity between them 63 . However, each of these subfamilies 
is apparently regulated differently in the cell 64 - 67 and observation sug- 
gests that there are different functional sites at which regulation can 
occur. Because the sequence identity between subfamilies is so high, 
standard sequence-similarity methods could easily misclassify new 
sequences as members of the wrong subfamily if the functional sites 
are not carefully considered, as was recently demonstrated 43 . 

These are but a few examples of the multifunctionality of proteins. 
The recognition of this multifunctional nature is of critical importance 
to the genomics field. Useful functional-annotation methods must con- 
sider all of the specific functions in a given protein and will not just 
provide a general classification of function. 



several novel functional sites in known, high-quality 
protein structures 3 - 34 . More automated methods for 
finding spatial motifs in protein structures have also 
been described 2 1 - 34 " 40 . 

Unfortunately, most of these methods require the 
exact placement of atoms within protein backbones and 
side chains, and so have not been shown to be relevant 
to inexact predicted structures. Recendy, however, we 
described the production of fuzzy, inexact descriptors 
of protein functional sites 15 . As we wish to apply the 
descriptors to experimental structures as well as to pre- 
dicted protein models, we used only carbon atoms and 
side-chain centers-of-mass positions. We call these 
descriptors 'fuzzy functional forms' (FFFs) and have 
created them for both the disulfide-oxidoreductase 15 - 41 
and ot/P-hydrolase catalytic active sites 23 . 

The disulfide-oxidoreductase FFF was applied to 
screen high-resolution structures from the Brookhayen 
protein database 42 . In a dataset of 364 protein structures, 
the FFF accurately identified all proteins known to 
exhibit the disulfide-oxidoreductase active site 15 . In a 
larger dataset of 1 501 proteins, the FFF again accurately 
identified all proteins with the active site. In addition, 
it identified another protein, lfjm, a serine-threonine 
phosphatase. This result was initially discouraging but 
subsequent sequence alignment and clustering analysis 
strongly suggested that this putative site might indeed 
be a site of redox regulation in the serine-threonine 
phosphatase-1 subfamily 43 . If confirmed by experiment, 
this result will highlight the advantages of using struc- 
tural descriptors to analyse multiple functional sites in 
proteins. It will also highlight the fact that human 
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Box 2. Knowing a protein's structure does not necessarily 
tell you its function 



Because proteins can have similar folds but different functions 68 ' 69 , 
determining the structure of a protein may or may not tell you some- 
thing about its function. The most well-studied example is the (a/p) 8 
barrel enzymes, of which triose-phosphate isomerase (TIM) is the arche- 
typal representative. Members of this family have similar overall struc- 
tures but different functions, including different active sites, substrate 
specificities and cofactor requirements 70 - 71 . 

Is this example common? Our own analysis of the 1997 SCOP data- 
base 68 shows that the five largest fold families are the ferredoxin- 
like, the (a/0) barrels, the knottins, the immunoglobulin-like and the 
flavodoxirvlike fold families with 22, 18, 1 3, 9 and 9 subfamilies, respec- 
tively (Fig. i). In fact, 57 of the SCOP fold families consist of multiple 
superfamilies. These data only show the tip of the iceberg, because 
each superfamily is further composed of protein families and each indi- 
vidual family can have radically different functions. For example, the 
ferredoxirdike superfamily contains families identified as Fe-S ferredoxins, 
ribosomal proteins, DNA-binding proteins and phosphatases, among 
others. 

After this article was submitted, a much-moredetailed analysis of the 
SCOP database was published 72 . This finds a broad function-structure 
correlation for some structural classes, but also finds a number of 
ubiquitous functions and structures that occur across a number of fam- 
ilies. The article provides a useful analysis of the confidence with which 
structure and function can be correlated 72 . Knowing the protein struc- 
ture by itself is insufficient to annotate a number of functional classes 
and is also insufficient for annotating the specific details of protein 
function. 
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Figure i 

Histogram of the numbers of superfamilies found in each SCOP fold family. 
These data clearly show that proteins with similar structures can have different 
functions and demonstrate the difficulty of assigning protein function based 
simply on the three-dimensional structure. The data were taken from the 1997 
distribution of SCOP (http://scop.mrc-lmbxam.ac.uk/scop). For a moredetailed 
analysis, see Ret. 72. 



observation alone is no longer adequate for identifying 
all functional sites in known protein structures. 

To date, the use of structure to identify function has 
largely focused on high-resolution structures and highly 
detailed descriptors of protein functional sites. How- 
ever, the creation of inexact descriptors for functional 
sites opens the way to the application of these methods 
to inexact, predicted protein models. The question 
remains: how good does a model have to be in order 
to use FFFs to identify its active sites? 



The state of the art in structure-prediction 
methods 

For proteins whose sequence identity is above -30%, 
one can use homology modeling to build the struc- 
ture 44 . However, structure prediction is far more difficult 
for proteins that are not homologous to proteins with 
known structure. At present, there are two approaches for 
these sequences: ab initio folding 45-48 and threading 49-53 . 

In ab initio folding, one starts from a random confor- 
mation and then attempts to assemble the native struc- 
ture. As this method does not rely on a library of 
pre-existing folds, it can be used to predict novel 
folds. The recent CASP3 protein-structure-prediction 
experiment (http://PredictionCenter.Unl.gov/ CASP3) 
involved the blind prediction of the structure of pro- 
teins whose actual structure was about to be experi- 
mentally determined. These results indicate that con- 
siderable progress has been made 46 ' 54 . For helical and 
ct/p proteins with less than 110 residues, structures 
were often predicted whose backbone root-mean- 
square deviation (RMSD) from native ranged from 
4—7 A. Progress is being made with the P proteins, too, 
although they remain problematic. Because ab initio 
methods can identify novel folds, these methods could 
be used to help to select sequences likely to yield novel 
folds in experimental structural-genomics projects. 

Another approach to tertiary-structure prediction is 
threading. Here, for the sequence of interest, one 
attempts to find the closest matching structure in a 
library of known folds 52 55 . Threading is applicable to 
proteins of up to 500 residues or so and is much faster 
than ab initio approaches. However, threading cannot 
be used to obtain novel folds. 

Ab initio predicted modeb can be used for automatic 
protein-function prediction 

The results of the recent CASP3 competition sug- 
gest that current modeling methods can often (but not 
always) create inexact protein models. Are these struc- 
tures useful for identifying functional sites in proteins? 
Using the ab initio structure-prediction program 
MONSSTER, the tertiary structure of a glutaredoxin, 
lego, was predicted 56 . For the lowest-energy model, 
the overall backbone RMSD from the crystal structure 
was 5.7 A. 

To determine whether this inexact model could be 
used for function identification, the sets of correctly 
and incorrectly folded structures were screened with 
the FFF for disulfide-oxidoreductase activity 15 . The 
FFF uniquely identified the active site in the correcdy 
folded structure but not in the incorrecdy folded ones 
(Fig. 2). This is a proof-of-principle demonstration that 
inexact models produced by ab initio prediction of 
structure from sequence can be used for the subsequent 
prediction ofbiochemical function. Of course, improve- 
ments in the method have to be made before such 
predictions can be done on a routine basis. 

Use of predicted structures from threading in 
protein-function prediction 

At present, practical limitations preclude folding an 
entire genome of proteins using ab initio methods 57 . 
Threading is more appropriate for achieving the requisite 
high-throughput structure prediction. Thus, a stand- 
ard threading algorithm 58 has been used to screen all 
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proteins in nine genomes for the disulfide-oxidoreductase 
active site described above. 

First, sequences that aligned with the structures of 
known disulfide oxidoreductases were identified. Then, 
the structure was searched for matches to the active- 
site residues and geometry. For those sequences for 
which other homologs were available, a sequence- 
conservation profile was constructed 23 . If the putative 
active-site residues were not conserved in the sequence 
subfamily to which the protein belongs, that sequence 
was eliminated. Otherwise, the sequence is predicted 
to have the function. 

Using this sequence-to-structure-to-function method, 
99% of the proteins in the nine genomes that have 
known disulfide-oxidoreductase activity have been 
found. From 10% to 30% more functional predictions 
are made than by alternative sequence-based approaches; 
similar results are seen for the a/p hydrolases 23 . Sur- 
prisingly, in spite of the fact that threading algorithms 
have problems generating good sequence-to-structure 
alignments, active sites are often accurately aligned, 
even for very distant matches. This observation would 
agree with the above experimental results indicating 
that active sites are well conserved in protein structures. 

Importandy, the false-positive rate when using struc- 
tural information is much lower than that found using 
sequence-based approaches, as demonstrated by a 
detailed comparison of the FFF structural approach and 
the Blocks sequence-motif approach (N. Siew et aL, 
unpublished). In this study, the sequences in eight 
genomes, including Bacillus subtilis, were analysed for 
disulfide-oxidoreductase function using the disulfide- 
oxidoreductase FFF, the thioredoxin Block 00194 and 
the glutaredoxin Block 00195, If we assume that those 
sequences identified by both the FFF and Blocks 
are 'true positives', we find 13 such sequences in the 
B. subtilis genome. 

There is no experimental evidence validating all of 
these 'true positives' and so they are more accurately 
termed 'consensus positives'. In order to find these 13 
'consensus positive' sequences, the FFF hits seven false 
positives. On the other hand, Blocks hits 23 false 
positives (Fig. 3). It was previously suggested that the 
use of a functional requirement adds information to 
threading and reduces the number of false positives 52 . 
These data, including the data shown in Fig. 3, validate 
this claim on a genome-wide basis. 

Of course, as no genome has had the function of all 
of its proteins experimentally annotated, it is imposs- 
ible to know how many other proteins with the speci- 
fied biochemical function were not properly identified. 
This is a critical question for researchers attempting to 
predict protein function. Experimental confirmation 
will be needed to validate this or any other method 
fully. This points out the need for closely coupling 
computational function-prediction algorithms with 
experiments. 

Weaknesses of using the sequence-to-structure- 
to-function method of function prediction 

Based on. studies to date, the identification of enzy- 
matic activity requires a model in which the backbone 
RMSD from native near the active sites is about 4-5 A. 
Predicted models are better at describing the geometry 
in the core of the molecule than in the loops and so 
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Figure 1 

The distribution of root-mean-square distributions (RMSD) between the hydrolase 
catalytic triad and all other histidine-containing triads shows a bimodal distribution 
(a); by contrast, the RMSD between a randomly selected (non-catalytic) triad and all 
other histidine-containing triads has a unimodal distribution (b). The His-Ser-Asp 
catalytic trial in the protein-l gpl (Rp2 lipase) (a) and a random histidine-containing 
triad from 4pga {glutaminase-asparaginase) (b) were structurally aligned to all His- 
containing triads in a database of 1037 proteins 23 . Actual a/0-hydrolase active sites 
(a) and the 4pga site (b) are indicated by blue bars; other histidine triads that are 
not active sites are indicated by red bars. None of the sites found by matching to the 
4pga were hydrolase active sites. Inset graphs show the full distribution. 

predicting the function of a protein whose active site is 
in loops may be a problem. Also, the method can cur- 
rendy only be applied to enzyme active sites; substrate- 
and ligand-binding sites have not been identified using 
the inexact models. Techniques that will further refine 
inexact protein models will be quite useful in taking 
the protein analysis to the next step. 

Conclusions 

Although sequence-based approaches to protein- 
function prediction have proved to be very useful, alter- 
natives are needed to assign the biochemical function 
of the 30-50% of proteins whose function cannot be 
assigned by any current methods. One emerging 
approach involves the sequence-to-structure-to-function 
paradigm. Such structures might be provided by struc- 
tural-genomics projects or by structure-prediction 
algorithms. Functional assignment is made by screen- 
ing the resulting structure against a library of structural 
descriptors for known active sites or binding regions. 
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Figure 2 

Application of the disulfide-oxidoreductase fuzzy functional form (FFF) to ab initio 
models of glutaredoxin created by the program MONSSTER shows that the FFF can 
distinguish between correctly folded and misfolded (or higher-energy) models. The FFF 
is shown as two orange balls (representing the cysteines) and a blue ball (represent- 
ing the proline). The protein models are shown as magenta wire models with the active- 
site cysteines and proline shown as yellow and cyan balls, respectively. The FFF clearly 
distinguishes the correct active site in the crystal structure of the glutaredoxin lego 
and the correctly folded, lowest-energy model. The FFF does not match to the active 
sites of any of the higher energy, misfolded structures, four of which are shown here. 
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Figure 3 

Analysis of the Bacillus subtilis genome using the thioredoxin Block 00194. The Blocks 
score (computed using the publicly available BLIMPS program) is plotted on the x axis 
and the number of sequences found in each scoring bin is plotted on the y axis. Those 
sequences identified as 'consensus positives' [identified by both the fuzzy functional 
form (FFF) and the Block} are shown as red bars. One additional sequence found by 
the FFF, which is likely to be a true positive, is shown as a blue bar. All other 
sequences, putative false positives', are shown as yellow bars. Using the Blocks 
score at which all 13 of the 'consensus positives' are found, 23 false positives are 
also found. In its analysis of the B. subti/is genome, the FFF identifies only seven false 
positives along with the same 13 'consensus positives' (data not shown). 



Detailed descriptors will only work on the experi- 
mentally determined, high-quality structures. Ideally, 
however, the descriptors should work on both experi- 
mental structures and the cruder models provided by 
tertiary-structure-prediction algorithms. 

The advantages of such an approach are that one need 
not establish an evolutionary relationship in order to 
assign function, that more than one function can be 



assigned to a given protein [an issue of major impor- 
tance, because proteins are multifunctional (Box 1)] 
and, ultimately, that having a structure can provide 
deeper insight into the biological mechanism of pro- 
tein function and regulation. The disadvantages are that 
one needs to have the protein s structure before a func- 
tion can be assigned and that the approach is limited to 
those functions associated with proteins with at least 
one solved structure, so that a functional-site descriptor 
can be constructed. 

In this sense, structure-to-function assignment can be 
thought of as 'functional threading* - find the active- 
site match in a library of descriptors for known protein 
active sites. This is the first step in the long process of 
using structure to assign all levels of function, a goal 
that is made increasingly important with the emergence 
of structural genomics. Based on the progress to date, 
it is apparent that structure will play an important role 
in the post-genomic era of biology. 
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LI 2183 S INVASIN 

L2 1 S LI AND CH3 

L3 0 S LI AND CH-3 

EXPAND WANG C/AU 

L4 15978 S E3 OR E4 OR E5 OR E6 OR E7 OR E8 OR E9 OR E10 OR Ell OR E12 

EXPAND WALFIED A/AU 

L5 161 S E3 OR E4 OR E5 OR E6 OR E7 OR E8 OR E9 

L6 16139 S L4 OR L5 

L7 7 S L6 AND INVASIN 

L8 3 DUP REMOVE L7 (4 DUPLICATES REMOVED) 

L9 335313 S EPITOPE 

L10 13561 S L9 AND ALLERG? 

Lll 4988041 S MUT? OR SUBSTITUT? 

L12 1534 S L10 AND Lll 

LI 3 783 DUP REMOVE LI 2 (751 DUPLICATES REMOVED) 

L14 32 S L13 AND LOOP 

L15 0 S L13 AND CORNER 

L16 103 S L13 AND SURFACE 

L17 5 S L13 AND ACCESSIBLE 

L18 123 S L14 OR L15 OR L16 OR L17 

L19 123 DUP REMOVE L18 (0 DUPLICATES REMOVED) 

L20 2 S LI 9 AND IMMUNOGEN 

L21 2 S LI 9 AND HYBRID 

L22 0 S LI 9 AND MOSAIC 

L23 38 S LI 9 AND MUTANT? 

L24 42 S L20 OR L21 OR L22 OR L23 
EXPAND KING T/AU 

L25 2000 S E3 OR E4 OR E5 OR E6 OR E7 OR E8 OR E9 OR E10 OR Ell OR E12 

EXPAND SPANGFORT M/AU 

L26 298 S E3 OR E4 OR E5 OR E6 OR E7 OR E8 

L27 2298 S L25 OR L26 

L28 4 S L27 AND L24 



